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Abstract - The use of methods borrowed from statistics and physics to analyze written texts has 
allowed the discovery of unprecedent patterns of human behavior and cognition by establishing 
links between models features and language structure. While current models have been useful to 
unveil patterns via analysis of syntactical and semantical networks, only a few works have probed 
the relevance of investigating the structure arising from the relationship between relevant entities 
such as characters, locations and organizations. In this study, we represent entities appearing 
in the same context as a co-occurrence network, where links are established according to a null 
model based on random, shuffled texts. Computational simulations performed in novels revealed 
that the proposed model displays interesting topological features, such as the small world feature, 
characterized by high values of clustering coefficient. The effectiveness of our model was verified 
in a practical pattern recognition task in real networks. When compared with traditional word 
adjacency networks, our model displayed optimized results in identifying unknown references in 
texts. Because the proposed representation plays a complementary role in characterizing unstruc¬ 
tured documents via topological analysis of named entities, we believe that it could be useful to 
improve the characterization of written texts (and related systems), specially if combined with 
traditional approaches based on statistical and deeper paradigms. 


Introduction. — The study of complex networks 
emerged in the beginning of the last decade as a pow¬ 
erful, robust and general representation of a myriad of 
real complex systems [^. Biological systems, transporta¬ 
tion networks, communication and information networks 
are examples of real systems whose underlying properties 
were unsnarled via graph representation. The investiga¬ 
tion of many complex systems in terms of their topological 
structure and dynamical behavior allowed the discovery of 
non-trivial connectivity/functional patterns, including the 
discovery of specific mesoscopic, heterogeneous and hier¬ 
archical structures responsible for particular functions . 

Even though several complex systems modeled as net¬ 
works are artlessly visualized as a graph representation 
(e.g. street or citation networks [^|^), many others 
undergo a pre-processing step to allow a proper networked 
representation. This is the case of instances represented in 
an attribute space , images modelled as networks and 
texts represented as graphs [^. Of particular interest to 
the aims of this study are the text networks, a model that 
has drawn the attention of physicists in recent years and 
has been of paramount relevance to shed light on the struc¬ 
ture and function of linguistic and cognitive processes . 


In addition, such textual representations has allowed the 
investigation of basic human behavior through the auto¬ 
matic analysis of the ever increasing amount of electronic 
data in social and information networks [^. 


Many physical models of texts have been proposed to 
tackle a diversity of problems |^. The representation 
of unstructured documents depends on the specific lan¬ 
guage properties being studied. If writing style or lan¬ 
guage identification is relevant, syntactical networks are 
used to grasp stylistic-dependent features 10 . In syntac¬ 


tical networks, each word is mapped into a node and edges 
are established according to syntactical relations, which 
are language-dependent constraints. Interestingly, it has 
been shown that such networks share the same topologi¬ 
cal properties of networks representing completely differ¬ 
ent systems 10 , allowing thus the construction of opti¬ 


mized structures for language acquisition and communi¬ 
cation 11 


Usually, syntactical networks are analyzed using a 
simplified representation, the so-called word adjacency 
model [5 12 ■ 14 . In this model, adjacent words are linked 


if they appear as neighbors in the text. It has been shown 
that, despite this seemingly naive simplification, word ad- 


p-1 











Diego Raphael Amanciol 


jacency networks are efficient representations for text anal¬ 
ysis because most of the syntactical connections occur 
between neighboring words 10 . The usefulness of such 


models has been verified in many theoretical and practi¬ 
cal research [^. Finally, another important class of text 
networks are the semantical networks, where links are es¬ 
tablished if concepts share some semantical property (e.g. 
a semantical similarity or a entailment relationship) [15| . 

Whilst several available methods grasp the relationship 
between all words or the relationship between specific 
classes of words [T^[l7] as a co-occurrence graph (see Sec¬ 
tion S4 of the Supplementary Information (SiQ, only a 
few studies have investigated the properties of entity co¬ 
occurrences as a complex network. Note that this is a 
relevant issue because the presence of specific words other 
than relevant entities in networked models may hinder the 
accurate recognition of novel patterns. In this sense, this 
study aims at creating a networked textual representa¬ 
tion that analyzes the topology emerging from the rela¬ 
tionship between specific words, namely the named en¬ 
tities, which are a class of particular concepts that has 
been useful to shed light on the understanding of several 
language properties 18 . Particularly, we consider that 


words represent named entities whenever they name peo¬ 
ple (characters), places or organizations. Hence, unlike 
traditional textual networked models, the proposed repre¬ 
sentation emphasizes the complexity of the document plot 
rather than linguistic styles. 

The application of the proposed network model that 
links entities co-occurring in the same context allowed 
the identification of some interesting topological patterns. 
Qualitatively, we have found that named entity networks 
share topological features of other complex systems, as re¬ 
vealed by low values of typical shortest path length and 
high local clustering. In addition, we have found that the 
networks display a modular structure. The potential of 
the proposed representation was verified in a typical pat¬ 
tern recognition problem devoted to the resolution of co¬ 
references. Optimized results were found with our model, 
which confirms that the information provided by the rep¬ 
resentation goes beyond simple word frequency statistics. 
Because named entity networks are a complementary rep¬ 
resentation of texts as networks, we believe that it could 
be useful to improve the characterization of written texts 
and related systems with nodes playing special roles. 

Extracting networks from books. — The objec¬ 
tive of this paper is to propose a simple but relevant model 
that captures co-occurrences of named entities in texts. To 
identify pertinent entities, we used the technique referred 
to as “named entity recognition”, which identifies rele¬ 
vant persons, locations and organizations in documents. 
Specifically, we used the method devised in 


from a modified Wikipedia articlej^ Note that the entities 
recovered from the method are highlighted in italic font: 
“Raracfc Hussein Obama II {US, born August 4, 1961) is 
the 44th and current President of the US, and the first 
African American to hold the office. Born in Honolulu, 
Hawaii, Obama is a graduate of Columbia University and 
Harvard Law School, where he served as president of the 
Harvard Law Review. He was a community organizer in 
Chicago before earning his law degree. He worked as a civil 
rights attorney and taught constitutional law at Univer¬ 
sity of Chicago Law School from 1992 to 2004. He served 
three terms representing the 13th District in the Illinois 
Senate from 1997 to 2004, running in 2000 unsuccessfully 
for the United States House of Representatives.” 

In the above example, the recognized entities denote per¬ 
sons (e.g. Obama), locations (e.g. Chicago and Hawaii) 
and organizations (e.g. Harvard Law School or Columbia 
University). After the entity recognition step, the text 
undergoes a pre-processing step to eliminate ambiguities 
and name duplications. In this step, for example, Emma 
Wodehouse and Emma are mapped into the same entity. 
After this pre-processing phase, a set R = {vi,V 2 ,...} of 
entities is obtained for each book. Note that, in this phase 
(training phase), only the known entities (such as Obama 
and Chicago are represented as nodes. Unknown entities 
(such as He) are included in a classification phase, for ex¬ 
ample, in the anaphora resolution problem (see “Results 
and discussion” section). To create the network of related 
entities, hereafter referred to as named entity (NE) net¬ 
work, the proposed method connects entities appearing in 
the same context. Thus, entities sharing some semanti¬ 
cal relationship tend to be connected. The motivation of 
this approach relies on the fact that concepts appearing 
in the same context tend to be semantically related 20 


The separation of the text in contexts is accomplished by 
splitting the entire document in shorter subtexts compris¬ 
ing the same number of tokens W. As a consequence, 
each book is represented by the set 4' = {Si, S 2 ,..., S^}, 
where Si is the i-th subtext. Note that, we could have 
chosen as a representative context the segments of texts 
structured as chapters or other natural textual structure. 
Specifically, we have not used this information concern¬ 
ing the structure of texts because this information is not 
readily available in all books of the dataset. To store the 
information concerning the co-occurrence of distinct enti¬ 
ties in the same subset, the matrix B is created. If entity 
Vi appears in the j-th subset, then = I, otherwise 
Bij = 0. The frequency of entities in subsets, i.e. the 
number of subtexts in which a entity appears is defined as 
fi = By- . Analogously, the frequency of co-occurrence 
of two entities Vi and vj is defined as 


19 , which 


tags words with three possible labels. An example of enti¬ 
ties recognition is shown in the following extract obtained 


fij — ^ BifcB 


jk- 


(1) 


^The Supplementary Information (SI) is available from this link 


^ en.Wikipedia.org/wiki/Barack_Qbama 
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The link between two entities is established if they co¬ 
occur at least in one set Si G 5'. The weight (i.e. the 
strength) of the link is computed as 


Wij = min{P{vi\vj), P{vj\vi)}, 


the weight Wij 


( 2 ) 

defined in 


where P{vi\vj) = Ujifr Here, 
equationj^is used to identify the strongest links, which are 
in turn stored in the adjacency matrix A. An example of 
network construction for a fictitious dataset is shown in 
Fig. S3 of the SI. 

To improve the characterization of NE networks, we also 
considered the significance of entities co-occurrence. More 
specifically, two entities Vi and Vj were connected only if 
the quantity fij was sufficiently higher than the same value 
expected in a null model, i.e. in a random (shuffled) text. 
Given two entities Vi and vj, the significance of the co¬ 
occurrence is estimated by computing how likely it is to 
observe more than k = fij co-occurrences of Vi and Vj 
in the null model. Equivalently, the p-value associated to 
the quantity k = fij is p = where p{k) is the 

probability of k co-occurrences of Vi and Vj in the random 
text. To compute p{k), we followed the approach devised 
If rii = fi and 77,2 = fj, p(k) can be computed as 


21 


p{k) = 


{N; k, til — fc, 772 — k) 


(A;ni)(A;n2) 
where (x; 7/1,..., 7/2) is a simplified notation: 

X ! 1 

(x;7/i,...,7/„) = 


( 3 ) 


( 4 ) 


7 / 1 ! .. .yj (x - yi - .. .7/„)! ■ 

Equation can be rewritten in a more convenient way, if 
the notation {a}b, defined as 


b-l 


{a}f, = J|(a-7) 


( 5 ) 


i=0 


is adopted for a > b. In this case, the likelihood p{k) can 
be written as 

{ni}k{n2}k{N - ni} n2-k 


p{k) = 


{N}nAk}k 

{ni}k{n2}k{N - 77i} n2-k 
{N}n^_k{N - 772 -1- k}k{k}k 

712—k— 1 

n 

i-o 
k-l 


n 

3=0 


N - j - m 
N-j 

(771 - j)(T72 - j) 

(A- 772 + k - j){k - j)' 


Therefore, the p-value associated to the number of ob¬ 


served co-occurrences is 21 


p{k) = 



i=o 


(A - 772 + k - j){k - j)' 


Note that the value p{k) defined in the above equation 
can be used to establish links between entities whose co¬ 
occurrence frequency is significant. 

Results and discussion. — In this section, the topo¬ 
logical properties of NE networks are investigated. In ad¬ 
dition, we apply the proposed networked representation 
to tackle a natural language processing task related to 
anaphora (or co-reference) resolution. The dataset used 
in the experiments is shown in Table S2 of the SI. The list 
comprises romances by distinct authors. All books were 
retrieved from the Project Gutenberg dataset[^ 

Statistical properties of named entities networks. To 
probe the topological properties of NE networks, the fol¬ 
lowing quantities were computed: the number of nodes 
(A), the average degree ((fc)), the clustering coefficient 
((G)) and the average shortest path length {{1)). In order 
to compare the properties of NE and random networks, the 
values of clustering coefficient and average shortest path 
length in equivalent random networks were also computed 
as: 

{C)r = {k)/N, (6) 


{l)r = log A. 


( 7 ) 


The results obtained for selected books of our dataset are 
shown in Table [TJ The results for the full dataset are 
shown in Table S3 of the SI. Note that, differently from 
traditional language networks where the number of nodes 
is proportional to the vocabulary size [^ [l0|[^[^ , in this 
case, the total number of nodes (i.e. the number of enti¬ 
ties) is much lower. From a qualitative point of view, NE 
and small-world networks share similar topological prop¬ 
erties, because in most cases {C) S> {C)r and (1) ~ {l)r. 
Even though the Zipf’s law may play a role in the ob¬ 
served small-world effect, the equivalence between these 
two effects is not straightforward because, in the proposed 
model, we consider segments of texts, where a quantity of 
K entity occurrences does not imply the formation of K 
edges. In addition, note that many entity occurrences tend 
not to be translated into an edge because of the bursti- 
ness effect, which is specially prominent in words denoting 
named entities 24 . An example of NE network is shown 


in Fig. S5 of the SI. As expected, central characters play a 
prominent role on the networked model. Another impor¬ 
tant feature of NE networks is their modular structure. In 
Fig-B we show the modular structure unfolded with tra¬ 
ditional community structure methods (see also Eig. 
S4 of the SI). This modular structure has been noted in 
most of the studied networks. 

In order to show how the network representation might 
provide complementary information for text analysis, we 
studied the problem of identifying the most central enti¬ 
ties. As an example, we used the book Middlemarch, by 
George Eliot. The following network centrality measure¬ 
ments were computed: betweenness centrality, PageRank 


^ WWW.gutenberg.org 
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Table 1: Statistical properties of NE networks. In most of 
the networks, the clustering coefficient is higher than the same 
quantity observed in equivalent random networks. With re¬ 
gard to the average shortest path lengths, the values observed 
in real and equivalent random networks are similar. Concern¬ 
ing the differences across books, we have identihed that most 
fluctuations arise from the difference in network size (result not 
shown). However, in specific cases (such as the value of (fc) in 
MAN and PER), the statistical differences are signihcative, as 
a consequence of distinct authors’ styles. Such differences could 
be explored in further studies aiming at identifying authorship 
in texts._ 


Book 

N 

(fc) 

(C) 

(C^)r 

(0 

(Or 

MAN 

44 

2.68 

0.192 

0.062 

2.98 

3.83 

EMM 

56 

4.21 

0.411 

0.075 

2.38 

2.80 

PER 

44 

3.27 

0.254 

0.074 

3.19 

3.19 

PRI 

47 

3.87 

0.294 

0.088 

2.87 

2.85 

SAS 

33 

3.51 

0.204 

0.102 

2.80 

2.78 

BLH 

140 

3.12 

0.241 

0.022 

4.90 

4.33 

DCP 

149 

3.30 

0.219 

0.022 

4.27 

4.19 

LDR 

175 

3.58 

0.351 

0.020 

4.87 

4.04 

BTW 

122 

3.05 

0.355 

0.021 

3.83 

4.31 

WWL 

138 

3.65 

0.226 

0.026 

4.32 

3.81 
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Eig. 1: Community structure obtained from the book Bleak 
House, a novel by Charles Dickens. The Eve largest communi¬ 
ties are highlighted. The communities were obtained with the 
fastgreedy optmization of the modularity . 


and accessibility. The betweenness is a global measure¬ 
ment that computes the number of shortest paths passing 
through nodes. The PageRank is a global centrality mea¬ 
surement that considers that a node is relevant whenever 
it is connected to other relevant nodes. Finally, the acces¬ 
sibility is a local measurement which can be understood 
as an extension of the degree connectivity as it quantifies 
the effective number of neighbors at a distance h from the 
reference node 26 . Mathematically, the accessibility (a) 
is defined as 


a 


(h) 


(i) = exp ( - XI, 


( 8 ) 


where is the probability of a random walker to go 
from node i to node j in h steps. All three network cen¬ 
trality indices were compared with the raw frequency of 
appearance of the respective entities in the books. In Ta¬ 
ble we show in decreasing order the 10 most relevant 
entities obtained with the frequency and the accessibility 
measurement. The most relevant entities obtained with 
the betweenness and PageRank are shown in Table S5 of 
the SI. As for the betweenness centrality, the six most rel¬ 
evant entities are the same as the ones obtained with the 
simple frequency. However, the entities appearing between 
seventh and tenth positions are low frequency entities that, 
nonetheless, are located in privileged network position. A 
similar behavior can be observed for the PageRank index: 
top relevant entities are also very frequent, even though 
some low frequency entities are important because they 
are a few hops away from the most relevant nodes. The 
most relevant entities captured by the accessibility index 
(computed for h = 2) turned out to be less influenced by 
the frequency. Note, for example, that Farehrother is the 


character with the highest effective number of neighbors at 
the second level, even though it is only the tenth most fre¬ 
quent entity in the book. In fact, a correlation analysis of 
less frequent entities showed that the relevance of entities 
according to topological indices may not be predicted by 
frequency alone (result not shown). By no means we are 
suggesting that network measurements are more relevant 
than traditional frequency indices. We rather suggest that 
network measurements could be included as an additional 
feature to analyze e.g. the complexity of book plots and 


literary movements 27-30 


Identifying patterns in NE networks: application to 
anaphora resolution. To show how the proposed network 
representation might be useful in a real application, we 
addressed the anaphora (or co-reference) resolution prob¬ 
lem (see a brief description of traditional methods in 
Section S4 of the SI). In this task, we aim at identifying 
the entity related to an undefined reference in the text. 
To illustrate the problem, consider the following example: 
“T/ie ordinance was published by Ricardo and Borsari, who 
are responsible for the management of water resources in 
the state. In the document, he points out the critical situ¬ 
ation of water in the region.’’’’ Note that, in this case, the 
objective of a system aiming at anaphora resolution is to 
relate “w/io” and ’’he” to either ’’Ricardo” or ’’Borsari”. 

The following methodology was applied to tackle the 
anaphora resolution problem. Differently from the train¬ 
ing phase (see Section “Extracting networks from books”), 
in this phase (classification phase), each unknown entity 
(e.g. ’’who” or “he” in the previous example) was mod¬ 
elled as a different node in the network. As such, the 
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Table 2: Rank of most relevant entities in Middlemarch (a 
novel by George Eliot) according to two centrality measure¬ 
ments. While frequency is obtained from simple word counts, 
the accessibility measurement is obtained from NE networks. 
Note that the most frequent entities assume different levels of 
relevance in the network according to the accessibility. 


# 

Frequency 

Accessibility 

1 

Tertius Lydgate 

Camden Farebrother 

2 

Dorothea Brooke 

Tertius Lydgate 

3 

Edward Casaubon 

Rosamond Vincy 

4 

Fred Vincy 

Mr. Tyke 

5 

Rosamond Vincy 

Edward Casaubon 

6 

Nicholas Bulstrode 

Dorothea Brooke 

7 

Mary Garth 

Caleb Garth 

8 

Celia Brooke 

Nicholas Bulstrode 

9 

James Chettam 

Fred Vincy 

10 

Candem Farebrother 

Laure 


network comprises nodes belonging to the set of unknown 
{E-?) and known entities. In other words, the final 
networks in this task contain vertices of different linguis¬ 
tic provenance, i.e. while some vertices represent known 
named entities; others model unknown references. We 
measured all possible pairwise similarities between enti¬ 
ties Vi G El and Vj G E\. Therefore, unknown references 
were characterized by their similarity to the others entities 
in the book. In the baseline approach based on simple co¬ 
occurrence statistics (COO), which does not consider the 
global topology of networks, each Vi G E? is considered to 
be similar only to those entities Vj G E\ that appeared in 
the same text window. Mathematically, the similarity /y 
between Vi G Ei and Vj G Eg in the baseline approach is 
computed as defined in equation]^ 

The co-occurrence approach is based on the premise 
that the references to unknown entities tend to occur in 
the same context, thus surrounding characters, locations 
and organizations tend to be the same. However, the co¬ 
occurrence approach only takes into account local infor¬ 
mation to measure the similarity between entities, which 
may lead to a large loss of relevant information. A more 
informed approach can consider the full network topology 
to quantify the pairwise similarities and thus improve the 
characterization of the context around a entity. Here, we 
used the Katz similarity [^, which is defined as: 


= = (I-aA)-i, (9) 

i=0 

where I is the identity matrix and a is a positive constant. 
If Ai is the leading eigenvalue of A, a must satisfy a < 
^ if equation is to converge . We also considered a 
variation of the Katz similarity (k) that does not consider 
the bias toward highly connected nodes. Mathematically, 
the similarity k between two nodes Vi G Ei and Vj G E\ is 


given by: 

kij = — ^ ^ -^ik^kj T (10) 

k 

where i5y accounts for the self-similarity term and is de¬ 
fined as 6ij = a Vi = Vj] and Sij = 0, otherwise. In 
matrix terms, k is written as 

K=(D-aA)-iD, (II) 


where D is the diagonal matrix with elements Dy- = fi, if 
Vi = Vj] and Dy = 0; otherwise. Note that both measure¬ 
ments K and k make use of the full network structure to 
compute pairwise similarities. This is evident when one in¬ 
terprets the quantity A”® as the matrix storing the number 
of paths of length m between two nodes . An example 
of identification of unknown entities in a toy network is 
discussed in Section S3 of the SI. 

In our real dataset, we addressed the anaphora resolu¬ 
tion task where two possible entities are candidates in a 
unsolved reference. This problem can be modelled as a su¬ 
pervised classihcation task 31 with two possible classes 


(see dehnitions in Section SI of the SI). The pairs of eval¬ 
uated entities are shown in the second column of Table [3l 
The best average accuracy rates obtained in our selected 
dataset are shown in Table For each pair of entities, 
we show the performance obtained with the traditional 
method based on co-occurrence statistics (COO, see equa¬ 
tion]^ and with methods making use of network similarity 
measures, as defined in equation (Katz similarity). The 
results obtained with the normalized Katz similarity (see 
equation 11) were no better than the ones obtained with 
the non-normalized version, as shown in Table S4 of the SI. 
Note that the use of global network information was able 
to improve the characterization of unsolved references, as 
revealed by higher accuracy indexes obtained with non¬ 
local network measurements when compared to the tra¬ 
ditional method based on simple co-occurrence statistics. 
This means that the co-occurrence with specihc entities 
might be useful to discriminate unknown entities whenever 
the test instance tend to appear in a community domi¬ 
nated by a specihc entity. In fact, a systematic error anal¬ 
ysis performed in our dataset revealed that most of the 
errors occur when distinct entities in the training dataset 
appear in the same community. Another recurrent error 
occurs when the test instance appears between two com¬ 
munities dominated by distinct entities (see Section S5 of 
the SI). This result conhrms the importance of the pat¬ 
terns unveiled by the proposed networked representation, 
which were hidden from traditional models. 

A systematic comparison with other traditional 
anaphora resolution methods not relying on any network 
information was also performed, and the results are pro¬ 
vided in Table S6 of the SI. In general, our networked ap¬ 
proach performed better than other statistical approaches. 
In light of the results obtained with NE networks draw¬ 
ing on the Katz measurement to quantify similarity be¬ 
tween unknown references, we advocate that topological 
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Table 3: Accuracy rate obtained when identifying references 
using the traditional model based on simple co-occurrence 
statistics (COO, as defined in equation and the proposed 
networked approach (see equation [^. The results obtained 
with equation are shown in Table S4 of the SI. The results 
obtained with traditional co-occurrence techniques not relying 
on networked information are shown in Table S6 of the SI. 


Book 

Entities 

COO 

Katz 

BLH 

Esther and Caroline 

69.0% 

81.7% 

BLH 

Richard and Ada 

58.0% 

75.0% 

EMM 

Emma and Harriet 

51.7% 

86.7% 

EMM 

Emma and Jane 

66.0% 

86.0% 

JUD 

Jude and Richard 

59.2% 

87.7% 

LDR 

Arthur and Mr. Pancks 

55.9% 

81.7% 

MMA 

Edward and Dorothea 

50.0% 

87.0% 

MMA 

Rosamond and Lydgate 

52.0% 

84.8% 

WWL 

Felix and Lady Carbury 

64.0% 

84.0% 

WWL 

Francis and Eleanor 

52.0% 

80.0% 


information could be used to complement more informed 
approaches based on deep language analysis. 

Conclusion. — In this study, we have introduced a 
model to form named entities networks, i.e. networks 
whose nodes denote people, locations and organizations. 
Unlike current networked representations of written texts, 
our model focused on the complexity arising from non¬ 
trivial co-occurrences between entities alone. In other 
words, we disregarded linguistic/stylist influences of the 
language on the construction of the networks. From this 
point of view, NE networks can be understood as a comple¬ 
mentary form of text representation. A topological analy¬ 
sis performed on novels revealed that NE networks display 
high-clustering and low typical shortest path lengths, a 
similar behavior found in other textual and non-textual 
networks. Another interesting finding arising from the 
topological analysis of our model is that NE networks 
can unveil patterns that cannot be unveiled with tradi¬ 
tional methods based on simple word count statistics. This 
was clear in the application of NE networks for identify¬ 
ing unknown references in texts. Particularly, the char¬ 
acterization textual relying upon NE networks outper¬ 
formed the traditional representation based on statistical 
co-occurrence analysis. We believe that the performance 
of applications using NE networks could be improved with 
further development of enhanced automatic entity rec¬ 
ognizers, which could treat referential opacities 33 and 
entities with multiple aliases (e.g. “morning star” and 
“evening star”) 


34 


NE networks turned out to play a complementary role 
in the characterization of written texts. While traditional 
approaches neglect the rich information underlying net¬ 
worked representations, our model is able to capture this 
type of information concerning the interactions between 
relevant entities. Given the variety of current applications 
and representations making use of networks in texts |^, 
we believe that the use of NE networks might be useful to 


recover language-independent patterns. Erom a practical 
point of view, the model could also lead to improved per¬ 
formances for recognizing styles, authorship, quality and 
plagiarisms in a higher level of abstraction. 
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